Assessing the Impact of Socio-economic Factors on Presidential Election Voting in the USA in 2016
Author
Affiliation
Zilu Wang
University of Glasgow
1 Introduction
The 2016 USA presidential election is one of the most dramatic and surprising elections in the US history. As a result, Republican nominee Donald Trump won the presidency with 304 electoral votes, compared to Democratic nominee Hillary Clinton’s 227 electoral votes. Trump became the first president without prior political or military experience. Additionally, he became the fifth president to win the presidency despite losing the popular vote, given that he received almost 3 million votes less than Hillary Clinton.
Donal Trump was known for his controversial statements and policies, which appealed to a significant portion of the American population. His victory in the 2016 election raised questions about the factors that influenced the voting patterns of the American electorate. In this analysis, we aim to explore the relationship between socio-economic factors and voting patterns in the 2016 presidential election. We will investigate how demographic, economic, and educational indicators at the county level may have influenced the voting outcomes in the election.
For better understanding of this paper, a brief summary of how US election works is provided. The U.S. presidential election is an indirect election where voters cast ballots for a slate of members of the Electoral College; these electors then directly vote for the President and Vice President. Each state is allocated a number of electors equal to its total number of Senators and Representatives in Congress, resulting in a total of 538 electors in the Electoral College. Most states have a “winner-takes-all” system where the candidate who receives the most popular votes in that state wins all its electoral votes. A candidate needs a majority of 270 electoral votes to win the presidency. The Electoral College system has been a subject of debate, as it is possible for a candidate to win the presidency without winning the popular vote, as was the case in the 2016 election.
The main datasets used in this project include the 2016 US presidential election results data, and the 2014 socio-economic data from the US Census Bureau. The election results dataset contains information on the number of votes received by each candidate in each county, the total number of votes, and the fraction of votes received by the Republican candidate. The demographic and socio-economic indicators dataset contains information on various indicators such as population, racial demographics, educational attainment, and median household income for each county. The dictionary dataset provides descriptions of the columns in the demographic and socio-economic indicators dataset.
There are three main research questions that we aim to address in this analysis: 1. Are there specific socio-economic or demographic factors that are associated with an increased or decreased preference for a political party, in a county? 2. Are there state-wide factors that are associated with a preference for one political party over another? 3. How well can the model associating socioeconomic factors with 2016 election results be used to predict the final state-wide outcome of the presidential elections in 2016?
The analysis will be conducted in several stages, including data inspection and pre-processing, exploratory data analysis, and predictive modeling. Various statistical and machine learning techniques will be applied to explore the relationship between socio-economic factors and voting patterns in the 2016 presidential election. The results of this analysis will provide insights into the factors that may have influenced the voting outcomes in the 2016 election and contribute to a better understanding of the dynamics of US presidential elections.
2 Methodology
2.1 Exploratory Data Analysis
2.1.1 Data Inspection and Pre-processing
# A tibble: 6 × 9
state state.po county FIPS candidatevotesR candidatevotesD totalvotes
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Alabama AL Autauga 1001 18172 5936 24973
2 Alabama AL Baldwin 1003 72883 18458 95215
3 Alabama AL Barbour 1005 5454 4871 10469
4 Alabama AL Bibb 1007 6738 1874 8819
5 Alabama AL Blount 1009 22859 2156 25588
6 Alabama AL Bullock 1011 1140 3530 4710
# ℹ 2 more variables: fracvotesR <dbl>, partywonR <dbl>
There is no missing value in the PresElect2016R dataset. However, there are 51 missing values in the state_abbreviation column in the UScounty_facts dataset. Now we select these rows with missing values to see the detail.
We see that these rows with missing value in the column state_abbreviation contain the data of demographic and socio-econimic indicators for each state. We will separate the data into two datasets: one for the states and the other for the counties.
Next, we need to check the dimensions of the datasets.
[1] 3141 9
[1] 3143 53
We can see that these two datasets do not have the same number of rows. We need to find out if there exist counties listed in one file may or may not appear in the other.
# A tibble: 1 × 3
FIPS county state
<dbl> <chr> <chr>
1 36000 Kansas City Missouri
# A tibble: 3 × 3
fips county state_abbreviation
<dbl> <chr> <chr>
1 15005 Kalawao County HI
2 31103 Keya Paha County NE
3 51515 Bedford city VA
There are in total of 238,828,200 registered votes, with turnout rate of 54.8%, meaning 54.8% of people age 18 or above voted in this election. The Democratic party received almost 3 million more votes than the Republican party, with 65,840,274 votes (51.11%) compared to 62,977,826 votes (48.89%). Despite this, the Republican party won a significantly higher number of counties, with 2,634 counties (83.86%) favoring them, while only 507 counties (16.14%) favored the Democratic party. In terms of state victories, the Republicans secured 30 states (58.82%), while the Democrats won 21 states (41.18%). This resulted in the Republican party gaining more electoral votes, with 306 (56.69%) compared to the Democratic party’s 232 electoral votes (43.31%), ultimately leading to their victory in the 2016 Presidential Election.
2016 Presidential Election Voting Results by State
State
Votes Received
Vote Share
Winner
Republican
Democratic
Republican %
Democratic %
Alabama
1,318,250
729,547
64.37
35.63
Republican
Alaska
163,343
116,181
58.44
41.56
Republican
Arizona
1,252,401
1,161,167
51.89
48.11
Republican
Arkansas
684,872
380,494
64.29
35.71
Republican
California
4,483,810
8,753,788
33.87
66.13
Democratic
Colorado
1,202,484
1,338,870
47.32
52.68
Democratic
Connecticut
673,215
897,572
42.86
57.14
Democratic
Delaware
185,127
235,603
44.00
56.00
Democratic
District of Columbia
12,723
282,830
4.30
95.70
Democratic
Florida
4,617,886
4,504,975
50.62
49.38
Republican
Georgia
2,089,104
1,877,963
52.66
47.34
Republican
Hawaii
128,847
266,891
32.56
67.44
Democratic
Idaho
409,055
189,765
68.31
31.69
Republican
Illinois
2,146,015
3,090,729
40.98
59.02
Democratic
Indiana
1,557,286
1,033,126
60.12
39.88
Republican
Iowa
800,983
653,669
55.06
44.94
Republican
Kansas
671,018
427,005
61.11
38.89
Republican
Kentucky
1,202,971
628,854
65.67
34.33
Republican
Louisiana
1,178,638
780,154
60.17
39.83
Republican
Maine
334,945
354,718
48.57
51.43
Democratic
Maryland
943,169
1,677,928
35.98
64.02
Democratic
Massachusetts
1,090,893
1,995,196
35.35
64.65
Democratic
Michigan
2,279,543
2,268,839
50.12
49.88
Republican
Minnesota
1,322,951
1,367,716
49.17
50.83
Democratic
Mississippi
700,714
485,131
59.09
40.91
Republican
Missouri
1,594,511
1,071,068
59.82
40.18
Republican
Montana
279,240
177,709
61.11
38.89
Republican
Nebraska
495,501
284,454
63.53
36.47
Republican
Nevada
512,058
539,260
48.71
51.29
Democratic
New Hampshire
345,790
348,526
49.80
50.20
Democratic
New Jersey
1,601,933
2,148,278
42.72
57.28
Democratic
New Mexico
319,667
385,234
45.35
54.65
Democratic
New York
2,814,589
4,547,562
38.23
61.77
Democratic
North Carolina
2,362,631
2,189,316
51.90
48.10
Republican
North Dakota
216,794
93,758
69.81
30.19
Republican
Ohio
2,841,005
2,394,164
54.27
45.73
Republican
Oklahoma
949,136
420,375
69.30
30.70
Republican
Oregon
782,403
1,002,106
43.84
56.16
Democratic
Pennsylvania
2,970,733
2,926,441
50.38
49.62
Republican
Rhode Island
180,490
251,888
41.74
58.26
Democratic
South Carolina
1,155,389
855,373
57.46
42.54
Republican
South Dakota
227,721
117,458
65.97
34.03
Republican
Tennessee
1,522,925
870,695
63.62
36.38
Republican
Texas
4,685,047
3,877,868
54.71
45.29
Republican
Utah
515,231
310,676
62.38
37.62
Republican
Vermont
95,369
178,573
34.81
65.19
Democratic
Virginia
1,769,443
1,981,473
47.17
52.83
Democratic
Washington
1,221,747
1,742,718
41.21
58.79
Democratic
West Virginia
489,371
188,794
72.16
27.84
Republican
Wisconsin
1,404,440
1,381,823
50.41
49.59
Republican
Wyoming
174,419
55,973
75.71
24.29
Republican
First, let’s see what counties has the highest Republican vote share and Democratic vote share.
Top 5 Counties with Highest Republican Vote Share
County
State
Republican Vote Share
Roberts
Texas
0.946
King
Texas
0.937
Motley
Texas
0.920
Hayes
Nebraska
0.918
Shackelford
Texas
0.916
4 out of the 5 counties with highest Republican vote share are in Texas.
Top 5 Counties with Highest Democratic Vote Share
County
State
Democratic Vote Share
District of Columbia
District of Columbia
0.909
Bronx
New York
0.885
Prince George's
Maryland
0.881
Petersburg
Virginia
0.872
Claiborne
Mississippi
0.868
Average Socio-Economic Indicators of Counties by Favouring Party
Average Socio-Economic Indicators of States by Favouring Party
Favouring Party
Median House Value
Income per capita
Bachelor's Degree Pct
White Population Pct
Population Density
Women Pct
Republican
194,876.47
28,053.80
28.83
62.05
12.99
50.78
Democratic
NaN
NaN
NaN
NaN
NaN
NaN
Average Socio-Economic Indicators of Counties by Favouring Party
winning_party
HSG495213
INC910213
INC110213
MAN450207
WTN220207
RTN130207
RTN131207
AFN120207
BPS030214
PST120214
VET605213
LFE305213
HSG010214
HSG445213
HSD410213
HSD310213
BZA010213
BZA110213
BZA115213
NES010213
SBO001207
LND110210
POP060210
AGE135214
AGE295214
AGE775214
SEX255214
RHI125214
RHI225214
RHI325214
RHI425214
RHI525214
RHI625214
RHI725214
RHI825214
POP715213
POP645213
POP815213
EDU635213
EDU685213
HSG096213
PVY020213
SBO315207
SBO115207
SBO215207
SBO515207
SBO415207
SBO015207
total_population
Republican
194,876.47
28,053.80
53,530.27
104,672,676.49
81,848,755.22
76,816,930.51
13,387.69
12,035,210.43
20,516.94
3.10
416,936.84
23.75
2,626,611.37
66.43
2,266,866.98
2.56
146,830.45
2,318,946.14
1.69
451,090.59
534,481.65
69,253.05
384.40
6.25
23.08
14.51
50.78
77.36
13.22
1.25
5.43
0.23
2.51
17.37
62.05
84.86
12.99
20.80
85.95
28.83
26.19
15.36
6.95
0.87
5.59
0.14
8.04
28.55
318,857,056.00
Democratic
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
0.00
Average Socio-Economic Indicators of Counties by Favouring Party
winning_party
HSG495213
INC910213
INC110213
MAN450207
WTN220207
RTN130207
RTN131207
AFN120207
BPS030214
PST120214
VET605213
LFE305213
HSG010214
HSG445213
HSD410213
HSD310213
BZA010213
BZA110213
BZA115213
NES010213
SBO001207
LND110210
POP060210
AGE135214
AGE295214
AGE775214
SEX255214
RHI125214
RHI225214
RHI325214
RHI425214
RHI525214
RHI625214
RHI725214
RHI825214
POP715213
POP645213
POP815213
EDU635213
EDU685213
HSG096213
PVY020213
SBO315207
SBO115207
SBO215207
SBO515207
SBO415207
SBO015207
total_population
Democratic
198,993.44
25,872.56
49,758.27
4,715,100.42
5,481,569.52
4,420,280.44
12,331.48
799,118.33
1,106.17
1.66
19,295.33
23.25
139,508.72
64.37
123,092.09
2.63
8,543.14
140,287.74
0.96
26,892.84
31,355.84
1,840.07
1,111.98
6.38
22.97
13.22
51.06
70.48
16.96
1.15
8.25
0.32
2.84
23.10
50.18
84.47
18.82
29.28
85.50
33.36
34.70
15.73
9.23
0.68
7.97
0.16
10.95
29.57
174,148,371.00
Republican
118,008.53
23,187.74
45,210.49
792,693.60
397,561.91
641,172.78
9,840.60
77,485.80
185.56
0.24
4,382.83
22.93
24,187.95
73.66
20,360.90
2.51
1,203.26
16,908.64
0.62
3,594.23
4,484.11
988.08
96.63
6.07
23.21
16.04
50.45
85.64
8.73
1.35
2.06
0.13
2.10
10.48
76.35
85.29
6.12
10.87
86.46
23.58
16.41
14.93
3.00
0.51
1.56
0.01
3.09
23.72
144,707,786.00
2.1.3 Visualisation
2.1.3.1 Win margin for state
Figure 1
2.1.3.2 winning party map
Figure 2
There’s seems to be correlation between geographical location and voting pattern. For example, the central states tend to vote Republican, while the coastal states and some northern states tend to vote Democratic.
Divide states into regions and plot the voting results by region.
state region division
1 Connecticut Northeast New England
2 Maine Northeast New England
3 Massachusetts Northeast New England
4 New Hampshire Northeast New England
5 Rhode Island Northeast New England
6 Vermont Northeast New England
7 New Jersey Northeast Middle Atlantic
8 New York Northeast Middle Atlantic
9 Pennsylvania Northeast Middle Atlantic
10 Indiana Midwest East North Central
11 Illinois Midwest East North Central
12 Michigan Midwest East North Central
13 Ohio Midwest East North Central
14 Wisconsin Midwest East North Central
15 Iowa Midwest West North Central
16 Kansas Midwest West North Central
17 Minnesota Midwest West North Central
18 Missouri Midwest West North Central
19 Nebraska Midwest West North Central
20 North Dakota Midwest West North Central
21 South Dakota Midwest West North Central
22 Delaware South South Atlantic
23 District of Columbia South South Atlantic
24 Florida South South Atlantic
25 Georgia South South Atlantic
26 Maryland South South Atlantic
27 North Carolina South South Atlantic
28 South Carolina South South Atlantic
29 Virginia South South Atlantic
30 West Virginia South South Atlantic
31 Alabama South East South Central
32 Kentucky South East South Central
33 Mississippi South East South Central
34 Tennessee South East South Central
35 Arkansas South West South Central
36 Louisiana South West South Central
37 Oklahoma South West South Central
38 Texas South West South Central
39 Arizona West Mountain
40 Colorado West Mountain
41 Idaho West Mountain
42 Montana West Mountain
43 Nevada West Mountain
44 New Mexico West Mountain
45 Utah West Mountain
46 Wyoming West Mountain
47 Alaska West Pacific
48 California West Pacific
49 Hawaii West Pacific
50 Oregon West Pacific
51 Washington West Pacific
Average Vote Share by Region
Region
Republican Vote Share
Democratic Vote Share
Midwest
0.52
0.48
Northeast
0.43
0.57
South
0.54
0.46
West
0.42
0.58
Average Vote Share by Division
Division
Republican Vote Share
Democratic Vote Share
East North Central
0.50
0.50
East South Central
0.64
0.36
Middle Atlantic
0.43
0.57
Mountain
0.53
0.47
New England
0.40
0.60
Pacific
0.36
0.64
South Atlantic
0.50
0.50
West North Central
0.57
0.43
West South Central
0.58
0.42
Midwest region tends to vote Republican, while the Northeast regions tend to vote Democratic.
Now let’s further divide states into divisions and plot the voting results by division.
There does seem to be correlation between geographical location and voting pattern. For example, East North Central, East South Central, and West South Central regions tend to vote Republican, while the Pacific and New England regions tend to vote Democratic. However, there still exists divions where the voting pattern is not as clear-cut, such as South Atlantic and Mountain regions.
This suggests that adding a new variable for region or division could be useful in predicting voting outcomes based on socio-economic indicators.
Further into county level, we will plot the voting results by county.
# A tibble: 4 × 12
state state.po county fips votesR votesD totalvotes fracR partywonR fracD
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Alaska AK Valde… 02261 2618 1226 3844 0.681 1 0.319
2 Alaska AK Wade … 02270 415 1180 1595 0.260 0 0.739
3 Missouri MO Kansa… 36000 24654 97735 122389 0.192 0 0.760
4 South D… SD Oglal… 46113 241 2510 2751 0.0830 0 0.864
# ℹ 2 more variables: frac_diff <dbl>, winning_party <chr>
Simple feature collection with 6 features and 4 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -1921992 ymin: -2158795 xmax: 60832.6 ymax: -141943.4
Projected CRS: NAD27 / US National Atlas Equal Area
# A tibble: 6 × 5
fips abbr full county geom
* <chr> <chr> <chr> <chr> <MULTIPOLYGON [m]>
1 02063 AK Alaska Chugach Census Area (((-1476669 -2101298, -1469…
2 02066 AK Alaska Copper River Census Area (((-1457015 -2063407, -1443…
3 02158 AK Alaska Kusilvak Census Area (((-1921992 -2073996, -1919…
4 15005 HI Hawaii Kalawao County (((-529505.2 -1982971, -526…
5 31103 NE Nebraska Keya Paha County (((-16148.74 -222421.3, -16…
6 46102 SD South Dakota Oglala Lakota County (((-242200.2 -150471.8, -23…
After looking up information on the Alaska government website, we found that in 2019, the Valdez-Cordova county was ablished and replaced by the Chugach Census Area and the Copper River Census Area. Also, in 2015, the Wade Hampton Census Area was renamed to the Kusilvak Census Area, and the fips code was changed from 02270 to 02158. We will deal with this later.
Histograms of all numerical variables at the county level
From the histograms, we can see that most of the predictors are right-skewed, indicating that the majority of counties have lower values for these indicators. This is expected as socio-economic indicators such as median household income, educational attainment, and racial demographics tend to vary significantly across counties.
This also suggests that data transformation methods such as logarithm transformation may be necessary to address the skewness in the variables.
Looking at the response variable fracR, we can see that the distribution is not symmetric, with a peak around 0.74. This indicates that the Republican vote share is higher in most counties.
Compared to histograms for county level, there is reduced but still significant skewness for most indicators at the state level.
[[1]]
[[2]]
[[3]]
[[4]]
[[5]]
We noticed that Kansas City, Missouri is missing in the county_facts dataset. We will remove this row from the merged dataset.
Now, let’s look at the averaged socio-economic indicators at the county level for each party. We will first start with population:
Demographic:
Educational Attainment
Housing
housing-2
Employment-1
Employment-2
Sales:
Source Code
---title: "Assessing the Impact of Socio-economic Factors on Presidential Election Voting in the USA in 2016"author: name: Zilu Wang affiliation: University of Glasgownumber-sections: trueformat: html: embed-resources: true code-tools: true pdf: documentclass: article fontsize: 11pt geometry: margin=1in linestretch: 1.5 keep-tex: true toc: true toc-depth: 3 number-sections: true fig-caption: trueeditor_options: chunk_output_type: consoleexecute: echo: false eval: true warning: false message: false---# Introduction {#sec-intro}The 2016 USA presidential election is one of the most dramatic and surprising elections in the US history. As a result, Republican nominee Donald Trump won the presidency with 304 electoral votes, compared to Democratic nominee Hillary Clinton’s 227 electoral votes. Trump became the first president without prior political or military experience. Additionally, he became the fifth president to win the presidency despite losing the popular vote, given that he received almost 3 million votes less than Hillary Clinton.Donal Trump was known for his controversial statements and policies, which appealed to a significant portion of the American population. His victory in the 2016 election raised questions about the factors that influenced the voting patterns of the American electorate. In this analysis, we aim to explore the relationship between socio-economic factors and voting patterns in the 2016 presidential election. We will investigate how demographic, economic, and educational indicators at the county level may have influenced the voting outcomes in the election.For better understanding of this paper, a brief summary of how US election works is provided. The U.S. presidential election is an indirect election where voters cast ballots for a slate of members of the Electoral College; these electors then directly vote for the President and Vice President. Each state is allocated a number of electors equal to its total number of Senators and Representatives in Congress, resulting in a total of 538 electors in the Electoral College. Most states have a "winner-takes-all" system where the candidate who receives the most popular votes in that state wins all its electoral votes. A candidate needs a majority of 270 electoral votes to win the presidency. The Electoral College system has been a subject of debate, as it is possible for a candidate to win the presidency without winning the popular vote, as was the case in the 2016 election.The main datasets used in this project include the 2016 US presidential election results data, and the 2014 socio-economic data from the US Census Bureau. The election results dataset contains information on the number of votes received by each candidate in each county, the total number of votes, and the fraction of votes received by the Republican candidate. The demographic and socio-economic indicators dataset contains information on various indicators such as population, racial demographics, educational attainment, and median household income for each county. The dictionary dataset provides descriptions of the columns in the demographic and socio-economic indicators dataset.There are three main research questions that we aim to address in this analysis:1. Are there specific socio-economic or demographic factors that are associated with an increased or decreased preference for a political party, in a county?2. Are there state-wide factors that are associated with a preference for one political party over another?3. How well can the model associating socioeconomic factors with 2016 election results be used to predict the final state-wide outcome of the presidential elections in 2016?The analysis will be conducted in several stages, including data inspection and pre-processing, exploratory data analysis, and predictive modeling. Various statistical and machine learning techniques will be applied to explore the relationship between socio-economic factors and voting patterns in the 2016 presidential election. The results of this analysis will provide insights into the factors that may have influenced the voting outcomes in the 2016 election and contribute to a better understanding of the dynamics of US presidential elections.# Methodology {#sec-methodology}```{r import libraries, message=FALSE, warning=FALSE}# import libraries and datalibrary(tidyverse)library(ggplot2)library(gt)library(maps)library(mapdata)library(reshape2)library(usmap)library(gridExtra)library(broom)library(caret)library(corrplot)library(factoextra)library(rpart)library(rpart.plot)library(pROC)``````{r import datasets, message=FALSE, warning=FALSE}# read data files## Read datasetsPresElect2016R <- read_csv("PresElect2016R copy.csv")UScounty_facts <- read_csv("UScounty-facts.csv", locale = locale(encoding = "ISO-8859-1"))dictionary <- read_csv("UScounty-dictionary.csv")# Rename columns in PresElect2016Rvotes_by_county <- PresElect2016R |> rename( state = state, state.po = state.po, county = county, fips = FIPS, votesR = candidatevotesR, votesD = candidatevotesD, fracR = fracvotesR )#Add columns fracD, totalvotes, winning_party to the votes_by_county dataframe.votes_by_county <- votes_by_county |> mutate( fracD = votesD / totalvotes, frac_diff = abs(fracR - fracD), totalvotes = votesR + votesD, winning_party = ifelse(fracR > fracD, "Republican", "Democratic") )votes_by_state <- votes_by_county %>% group_by(state, state.po) %>% summarise( totalvotes = sum(totalvotes, na.rm = TRUE), votesR = sum(votesR, na.rm = TRUE), votesD = sum(votesD, na.rm = TRUE), fracR = votesR / totalvotes, fracD = votesD / totalvotes, frac_diff = abs(fracR - fracD), partywonR = ifelse(votesR > votesD, 1, 0), winning_party = if_else(fracR > fracD, "Republican", "Democratic") )state_facts <- UScounty_facts |> filter(is.na(state_abbreviation)) |> rename(state = area_name) |> select(-state_abbreviation, -fips)county_facts <- UScounty_facts |> filter(!is.na(state_abbreviation)) |> rename(county = area_name)state_facts$state <- ifelse(state_facts$state == "District Of Columbia", "District of Columbia", state_facts$state)merged_data_state <- merge(votes_by_state, state_facts, by = "state")merged_data_state$winning_party <- as.factor(merged_data_state$winning_party)# Merge votes_by_county with county_facts by fipsmerged_data_county <- left_join(votes_by_county, county_facts, by = "fips") merged_data_county <- merged_data_county |> filter(complete.cases(county.y)) |> select(-county.y, -state_abbreviation) |> rename(county = county.x)merged_data_county$winning_party <- as.factor(merged_data_county$winning_party)# merge merged_data_county with state_info by statemerged_data_county <- merge(merged_data_county, state_facts, by = "state")```## Exploratory Data Analysis {#sec-eda}### Data Inspection and Pre-processing {#sec-eda-inspection}```{r dataset overview}# print the first 5 rows of the PresElect2016R datasethead(PresElect2016R)# print the first 5 rows of the UScounty_facts datasethead(UScounty_facts)```The first thing we need to check is whether there is any missing data in the datasets.```{r missing data}# Check for missing values in the PresElect2016R datasetcolSums(is.na(PresElect2016R))# Check for missing values in the UScounty_facts datasetcolSums(is.na(UScounty_facts))```There is no missing value in the `PresElect2016R` dataset. However, there are 51 missing values in the `state_abbreviation` column in the `UScounty_facts` dataset. Now we select these rows with missing values to see the detail.```{r select rows with missing value}# select rows with missing values in the UScounty_facts datasetUScounty_facts[!complete.cases(UScounty_facts), ]```We see that these rows with missing value in the column `state_abbreviation` contain the data of demographic and socio-econimic indicators for each state. We will separate the data into two datasets: one for the states and the other for the counties.```{r county and state facts}glimpse(state_facts)glimpse(county_facts)```Next, we need to check the dimensions of the datasets.```{r dataset dimension}# Check the dimensions of the datasetsdim(PresElect2016R)dim(county_facts)```We can see that these two datasets do not have the same number of rows. We need to find out if there exist counties listed in one file may or may not appear in the other.```{r}# Find FIPS codes in PresElect2016R but not in UScounty_factsfips_in_election_not_in_facts <-setdiff(PresElect2016R$FIPS, county_facts$fips)# Find FIPS codes in UScounty_facts but not in PresElect2016Rfips_in_facts_not_in_election <-setdiff(county_facts$fips, PresElect2016R$FIPS)# Get county names for FIPS codes in PresElect2016R but not in UScounty_factscounties_in_election_not_in_facts <- PresElect2016R %>%filter(FIPS %in% fips_in_election_not_in_facts) %>%select(FIPS, county, state)# Get county names for FIPS codes in UScounty_facts but not in PresElect2016Rcounties_in_facts_not_in_election <- county_facts %>%filter(fips %in% fips_in_facts_not_in_election) %>%select(fips, county, state_abbreviation)# Print the Counties in PresElect2016R but not in UScounty_factsprint(counties_in_election_not_in_facts)print(counties_in_facts_not_in_election)```### Statistical Summary {#sec-eda-summary}```{r}# Rename columns in PresElect2016Rvotes_by_county <- PresElect2016R |>rename(state = state,state.po = state.po,county = county,fips = FIPS,votesR = candidatevotesR,votesD = candidatevotesD, fracR = fracvotesR )#Add columns fracD, totalvotes, winning_party to the votes_by_county dataframe.votes_by_county <- votes_by_county |>mutate(fracD = votesD / totalvotes,frac_diff =abs(fracR - fracD),totalvotes = votesR + votesD,winning_party =ifelse(fracR > fracD, "Republican", "Democratic") )votes_by_state <- votes_by_county %>%group_by(state, state.po) %>%summarise(totalvotes =sum(totalvotes, na.rm =TRUE),votesR =sum(votesR, na.rm =TRUE),votesD =sum(votesD, na.rm =TRUE),fracR = votesR / totalvotes,fracD = votesD / totalvotes,frac_diff =abs(fracR - fracD), partywonR =ifelse(votesR > votesD, 1, 0),winning_party =if_else(fracR > fracD, "Republican", "Democratic") )# Create a dataframe with the number of electoral seats for each stateelectoral_seats <- tibble::tibble(state =c("Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "District of Columbia", "Florida", "Georgia", "Hawaii", "Idaho", "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine", "Maryland", "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri", "Montana", "Nebraska", "Nevada", "New Hampshire", "New Jersey", "New Mexico", "New York", "North Carolina", "North Dakota", "Ohio", "Oklahoma", "Oregon", "Pennsylvania", "Rhode Island", "South Carolina", "South Dakota", "Tennessee", "Texas", "Utah", "Vermont", "Virginia", "Washington", "West Virginia", "Wisconsin", "Wyoming"),electoral_votes =c(9, 3, 11, 6, 55, 9, 7, 3, 3, 29, 16, 4, 4, 20, 11, 6, 6, 8, 8, 4, 10, 11, 16, 10, 6, 10, 3, 5, 6, 4, 14, 5, 29, 15, 3, 18, 7, 7, 20, 4, 9, 3, 11, 38, 6, 3, 13, 12, 5, 10, 3))votes_by_state <- votes_by_state %>%left_join(electoral_seats, by ="state")# Merge county level datasetmerged_data_county <-left_join(votes_by_county, county_facts, by =c("fips"="fips"))merged_data_county <- merged_data_county[complete.cases(merged_data_county), ]# Mutate the DC namestate_facts$state <-ifelse(state_facts$state =="District Of Columbia", "District of Columbia", merged_data_county$state)# Merge state level datasetmerged_data_state <-left_join(votes_by_state, state_facts, by =c("state"="state"))# select rows with missing valuemerged_data_state[!complete.cases(merged_data_state), ]``````{r summary table, message=FALSE, warning=FALSE}#| label: tbl-result-summary#| tbl_cap: Results Summary of 2016 US Presidential Election# Load necessary librarylibrary(dplyr)# Calculate total votes received by Republicans and Democratstotal_votes_R <- sum(votes_by_state$votesR)total_votes_D <- sum(votes_by_state$votesD)total_votes <- total_votes_R + total_votes_D# Calculate percentage of votes received by Republicans and Democratspercent_votes_R <- (total_votes_R / total_votes) percent_votes_D <- (total_votes_D / total_votes) # Calculate total counties won by Republicans and Democratstotal_counties_R <- sum(votes_by_county$partywonR)total_counties_D <- nrow(votes_by_county) - total_counties_Rtotal_counties <- nrow(votes_by_county)# Calculate percentage of counties won by Republicans and Democratspercent_counties_R <- (total_counties_R / total_counties) percent_counties_D <- (total_counties_D / total_counties) # Calculate total states won by Republicans and Democratstotal_states_R <- sum(votes_by_state$partywonR)total_states_D <- nrow(votes_by_state) - total_states_Rtotal_states <- nrow(votes_by_state)# Calculate percentage of states won by Republicans and Democratspercent_states_R <- (total_states_R / total_states) percent_states_D <- (total_states_D / total_states) # Calculate total electoral votes won by Republicans and Democratstotal_electoral_votes_R <- sum(votes_by_state$partywonR * votes_by_state$electoral_votes)total_electoral_votes_D <- sum((1 - votes_by_state$partywonR) * votes_by_state$electoral_votes)# Calculate percentage of electoral votes won by Republicans and Democratspercent_electoral_votes_R <- ((total_electoral_votes_R+1) / sum(votes_by_state$electoral_votes)) percent_electoral_votes_D <- ((total_electoral_votes_D-1) / sum(votes_by_state$electoral_votes)) # Create the final summary tablesummary_table <- tibble::tibble( Category = c("Total votes received", "Number of counties won", "Number of states won", "Electoral votes"), Republican = c(total_votes_R, total_counties_R, total_states_R, total_electoral_votes_R + 1), Democratic = c(total_votes_D, total_counties_D, total_states_D, total_electoral_votes_D - 1), `Republican %` = c(percent_votes_R, percent_counties_R, percent_states_R, percent_electoral_votes_R), `Democratic %` = c(percent_votes_D, percent_counties_D, percent_states_D, percent_electoral_votes_D), Total = c(total_votes, total_counties, total_states, sum(votes_by_state$electoral_votes)))# Display the summary tablesummary_table |> mutate(`Republican %` = 100 * `Republican %`, `Democratic %` = 100 * `Democratic %`) |> gt() |> tab_header(title = "Summary of 2016 Presidential Election Results") |> tab_spanner( label = "Votes Received", columns = vars(Republican, Democratic) ) |> tab_spanner( label = "Votes Share", columns = vars(`Republican %`, `Democratic %`) ) |> fmt_number( columns = vars(Republican, Democratic, Total), decimals = 0, use_seps = TRUE ) |> fmt_number( columns = vars(`Republican %`, `Democratic %`), decimals = 2)```There are in total of 238,828,200 registered votes, with turnout rate of 54.8%, meaning 54.8% of people age 18 or above voted in this election. The Democratic party received almost 3 million more votes than the Republican party, with 65,840,274 votes (51.11%) compared to 62,977,826 votes (48.89%). Despite this, the Republican party won a significantly higher number of counties, with 2,634 counties (83.86%) favoring them, while only 507 counties (16.14%) favored the Democratic party. In terms of state victories, the Republicans secured 30 states (58.82%), while the Democrats won 21 states (41.18%). This resulted in the Republican party gaining more electoral votes, with 306 (56.69%) compared to the Democratic party's 232 electoral votes (43.31%), ultimately leading to their victory in the 2016 Presidential Election.```{r}votes_by_state <-ungroup(votes_by_state)votes_by_state |> dplyr::select(state, votesR, votesD, fracR, fracD, winning_party) |>mutate(fracR =100* fracR, fracD =100* fracD) |>gt() |>tab_header(title ="2016 Presidential Election Voting Results by State") |>tab_spanner(label ="Votes Received",columns =vars(votesR, votesD) ) |>tab_spanner(label ="Vote Share",columns =vars(fracR, fracD) ) |>cols_label(state ="State",votesR ="Republican",votesD ="Democratic",fracR ="Republican %",fracD ="Democratic %",winning_party ="Winner" ) |>fmt_number(columns =vars( fracR, fracD),decimals =2 ) |>fmt_number(columns =vars(votesR, votesD),decimals =0,use_seps =TRUE )```First, let's see what counties has the highest Republican vote share and Democratic vote share.```{r}#| label: tbl-top5R#| tbl_cap: Top 5 Counties with Highest Republican Vote Sharevotes_by_county |>arrange(desc(fracR)) |> dplyr::select(county, state, fracR) |>slice_head(n =5) |>gt() |>tab_header(title ="Top 5 Counties with Highest Republican Vote Share") |>cols_label(state ="State",county ="County",fracR ="Republican Vote Share" ) |>fmt_number(columns =vars(fracR),decimals =3 )```4 out of the 5 counties with highest Republican vote share are in Texas.```{r}#| label: tbl-top5D#| tbl_cap: Top 5 Counties with Highest Democratic Vote Sharevotes_by_county |>arrange(desc(fracD)) |>select(county, state, fracD) |>slice_head(n =5) |>gt() |>tab_header(title ="Top 5 Counties with Highest Democratic Vote Share") |>cols_label(state ="State",county ="County",fracD ="Democratic Vote Share" ) |>fmt_number(columns =vars(fracD),decimals =3 )``````{r}library(dplyr)# Ensure winning_party is a factor for easier aggregationmodel_data <- merged_data_county %>%mutate(winning_party =as.factor(winning_party))# Multiply the percentage variables by the corresponding populationmodel_data <- model_data %>%mutate(EDU685213_weighted = EDU685213 * PST045214,RHI825214_weighted = RHI825214 * PST045214,POP645213_weighted = POP645213 * PST045214, SEX255214_weighted = SEX255214 * PST045214 )# Aggregate the data by winning_partyaggregated_data <- model_data %>%group_by(winning_party) %>%summarise(total_population =sum(PST045214, na.rm =TRUE),HSG495213_avg =mean(HSG495213, na.rm =TRUE),INC910213_avg =mean(INC910213, na.rm =TRUE),EDU685213_weighted_sum =sum(EDU685213_weighted, na.rm =TRUE),RHI825214_weighted_sum =sum(RHI825214_weighted, na.rm =TRUE),POP645213_weighted_sum =sum(POP645213_weighted, na.rm =TRUE), SEX255214_weighted_sum =sum(SEX255214_weighted, na.rm =TRUE) )# Calculate the average values for the percentage variablesaggregated_data <- aggregated_data %>%mutate(EDU685213_avg = EDU685213_weighted_sum / total_population,RHI825214_avg = RHI825214_weighted_sum / total_population,POP645213_avg = POP645213_weighted_sum / total_population, SEX255214_avg = SEX255214_weighted_sum / total_population ) %>% dplyr::select( winning_party, HSG495213_avg, INC910213_avg, EDU685213_avg, RHI825214_avg, POP645213_avg, SEX255214_avg )# Print the summarized tableaggregated_data |>gt() |>tab_header(title ="Average Socio-Economic Indicators of Counties by Favouring Party") |>cols_label(winning_party ="Favouring Party",HSG495213_avg ="Median House Value",INC910213_avg ="Income per capita",EDU685213_avg ="Bachelor's Degree Pct",RHI825214_avg ="White Population Pct",POP645213_avg ="Population Density", SEX255214_avg ="Women Pct" ) |>fmt_number(columns =vars(HSG495213_avg, INC910213_avg, EDU685213_avg, RHI825214_avg, POP645213_avg, SEX255214_avg),decimals =2 )``````{r}library(dplyr)# Ensure winning_party is a factor for easier aggregationmodel_data <- merged_data_state %>%mutate(winning_party =as.factor(winning_party))# Check for any NAs in winning_partyprint(sum(is.na(model_data$winning_party)))# Print unique values in winning_partyprint(unique(model_data$winning_party))# Multiply the percentage variables by the corresponding populationmodel_data <- model_data %>%mutate(EDU685213_weighted = EDU685213 * PST045214,RHI825214_weighted = RHI825214 * PST045214,POP645213_weighted = POP645213 * PST045214, SEX255214_weighted = SEX255214 * PST045214 )# Check for any infinite valuesprint(colSums(sapply(model_data, is.infinite)))# Aggregate the data by winning_partyaggregated_data <- model_data %>%group_by(winning_party) %>%summarise(total_population =sum(PST045214, na.rm =TRUE),HSG495213_avg =mean(HSG495213, na.rm =TRUE),INC910213_avg =mean(INC910213, na.rm =TRUE),EDU685213_weighted_sum =sum(EDU685213_weighted, na.rm =TRUE),RHI825214_weighted_sum =sum(RHI825214_weighted, na.rm =TRUE),POP645213_weighted_sum =sum(POP645213_weighted, na.rm =TRUE), SEX255214_weighted_sum =sum(SEX255214_weighted, na.rm =TRUE) )# Calculate the average values for the percentage variablesaggregated_data <- aggregated_data %>%mutate(EDU685213_avg = EDU685213_weighted_sum / total_population,RHI825214_avg = RHI825214_weighted_sum / total_population,POP645213_avg = POP645213_weighted_sum / total_population, SEX255214_avg = SEX255214_weighted_sum / total_population ) %>% dplyr::select( winning_party, HSG495213_avg, INC910213_avg, EDU685213_avg, RHI825214_avg, POP645213_avg, SEX255214_avg )# Print the summarized tableaggregated_data %>%gt() %>%tab_header(title ="Average Socio-Economic Indicators of States by Favouring Party") %>%cols_label(winning_party ="Favouring Party",HSG495213_avg ="Median House Value",INC910213_avg ="Income per capita",EDU685213_avg ="Bachelor's Degree Pct",RHI825214_avg ="White Population Pct",POP645213_avg ="Population Density", SEX255214_avg ="Women Pct" ) %>%fmt_number(columns =vars(HSG495213_avg, INC910213_avg, EDU685213_avg, RHI825214_avg, POP645213_avg, SEX255214_avg),decimals =2 )``````{r}# Ensure winning_party is a factor for easier aggregationmodel_data <- merged_data_state %>%mutate(winning_party =as.factor(winning_party))# Calculate weighted values for percentage variablespercentage_vars <-c("AGE135214", "AGE295214", "AGE775214", "SEX255214", "RHI125214", "RHI225214", "RHI325214", "RHI425214", "RHI525214", "RHI625214", "RHI725214", "RHI825214", "POP715213", "POP645213", "POP815213", "EDU635213", "EDU685213", "HSG096213", "PVY020213", "SBO315207", "SBO115207", "SBO215207", "SBO515207", "SBO415207", "SBO015207")model_data <- model_data %>%mutate(across(all_of(percentage_vars), ~ . * PST045214, .names ="{.col}_weighted"))# Aggregate the data by winning_partyaggregated_data <- model_data %>%group_by(winning_party) %>%summarise(total_population =sum(PST045214, na.rm =TRUE),across(c(HSG495213, INC910213, INC110213, MAN450207, WTN220207, RTN130207, RTN131207, AFN120207, BPS030214), ~mean(., na.rm =TRUE), .names ="{.col}_avg"),across(ends_with("_weighted"), ~sum(., na.rm =TRUE), .names ="{.col}_sum"),across(c(PST120214, VET605213, LFE305213, HSG010214, HSG445213, HSD410213, HSD310213, BZA010213, BZA110213, BZA115213, NES010213, SBO001207, LND110210, POP060210), ~mean(., na.rm =TRUE), .names ="{.col}_avg") )# Calculate the average values for the percentage variablesaggregated_data <- aggregated_data %>%mutate(across(ends_with("_weighted_sum"), ~ . / total_population, .names ="{str_remove(.col, '_weighted_sum')}_avg"))# Select and rename columnsaggregated_data <- aggregated_data %>% dplyr::select(winning_party, ends_with("_avg"), total_population) %>%rename_with(~str_remove(.x, "_avg"), ends_with("_avg"))# Print the summarized tableaggregated_data %>%gt() %>%tab_header(title ="Average Socio-Economic Indicators of Counties by Favouring Party") %>%fmt_number(columns =-winning_party, decimals =2)``````{r}# Ensure winning_party is a factor for easier aggregationmodel_data <- merged_data_county %>%mutate(winning_party =as.factor(winning_party))# Calculate weighted values for percentage variablespercentage_vars <-c("AGE135214", "AGE295214", "AGE775214", "SEX255214", "RHI125214", "RHI225214", "RHI325214", "RHI425214", "RHI525214", "RHI625214", "RHI725214", "RHI825214", "POP715213", "POP645213", "POP815213", "EDU635213", "EDU685213", "HSG096213", "PVY020213", "SBO315207", "SBO115207", "SBO215207", "SBO515207", "SBO415207", "SBO015207")model_data <- model_data %>%mutate(across(all_of(percentage_vars), ~ . * PST045214, .names ="{.col}_weighted"))# Aggregate the data by winning_partyaggregated_data <- model_data %>%group_by(winning_party) %>%summarise(total_population =sum(PST045214, na.rm =TRUE),across(c(HSG495213, INC910213, INC110213, MAN450207, WTN220207, RTN130207, RTN131207, AFN120207, BPS030214), ~mean(., na.rm =TRUE), .names ="{.col}_avg"),across(ends_with("_weighted"), ~sum(., na.rm =TRUE), .names ="{.col}_sum"),across(c(PST120214, VET605213, LFE305213, HSG010214, HSG445213, HSD410213, HSD310213, BZA010213, BZA110213, BZA115213, NES010213, SBO001207, LND110210, POP060210), ~mean(., na.rm =TRUE), .names ="{.col}_avg") )# Calculate the average values for the percentage variablesaggregated_data <- aggregated_data %>%mutate(across(ends_with("_weighted_sum"), ~ . / total_population, .names ="{str_remove(.col, '_weighted_sum')}_avg"))# Select and rename columnsaggregated_data <- aggregated_data %>% dplyr::select(winning_party, ends_with("_avg"), total_population) %>%rename_with(~str_remove(.x, "_avg"), ends_with("_avg"))# Print the summarized tableaggregated_data %>%gt() %>%tab_header(title ="Average Socio-Economic Indicators of Counties by Favouring Party") %>%fmt_number(columns =-winning_party, decimals =2)```### Visualisation {#sec-eda-visualisation}#### Win margin for state```{r}#| label: fig-vote-share-state#| fig_cap: Vote Share by State for 2016 Presidential Electionlibrary(scales)library(ggplot2)library(dplyr)library(tidyr)# Prepare the dataplot_data <- votes_by_state %>%select(state, fracR, fracD) %>%pivot_longer(cols =c(fracR, fracD), names_to ="party", values_to ="vote_share") %>%mutate(party =ifelse(party =="fracR", "Republican", "Democratic"))# Create the plotggplot(plot_data, aes(y =reorder(state, vote_share), x = vote_share, fill = party)) +geom_bar(stat ="identity", position ="fill", width =0.8) +scale_fill_manual(values =c("Democratic"="blue", "Republican"="red")) +scale_x_continuous(labels =percent_format()) +labs(title ="2016 Presidential Election Vote Share by State",y ="State",x ="Vote Share",fill ="Party") +geom_vline(xintercept =0.5, linetype ="dashed", color ="black") +theme_minimal() +theme(axis.text.y =element_text(size =8),legend.position ="bottom")```#### winning party map```{r}#| label: fig-winners-state#| fig_cap: Winners by State for 2016 Presidential Election#| # Visualise the winning party at the state level on a US map# Prepare the datavotes_by_state_map <- votes_by_state %>%mutate(winning_party_color =ifelse(winning_party =="Republican", "red", "blue"),color_intensity = scales::rescale(frac_diff, to =c(0.3, 2)))# Function to generate color based on party and intensityget_color <-function(party, intensity) {if (party =="red") {return(scales::alpha("red", intensity)) } else {return(scales::alpha("blue", intensity)) }}# Apply the function to get the colorvotes_by_state_map <- votes_by_state_map %>%mutate(color =mapply(get_color, winning_party_color, color_intensity))# Create a dummy variable for legendvotes_by_state_map$legend_party <-ifelse(votes_by_state_map$winning_party =="Republican", "Republican", "Democratic")# Plottingplot_usmap(data = votes_by_state_map, values ="color", color ="white", labels =TRUE) +scale_fill_identity(name ="Winning Party",breaks =c("red", "blue"),labels =c("Republican", "Democratic"),guide =guide_legend(override.aes =list(fill =c("red", "blue")))) +labs(title ="2016 Presidential Election Results by State",fill ="Winning Party") +theme(legend.position ="right",legend.title =element_text(size =8),legend.text =element_text(size =7))```There's seems to be correlation between geographical location and voting pattern. For example, the central states tend to vote Republican, while the coastal states and some northern states tend to vote Democratic.Divide states into regions and plot the voting results by region.```{r}# Create a mapping of states to regions# Define the state, region, and division mappingsstates <-c("Connecticut", "Maine", "Massachusetts", "New Hampshire", "Rhode Island", "Vermont","New Jersey", "New York", "Pennsylvania","Indiana", "Illinois", "Michigan", "Ohio", "Wisconsin","Iowa", "Kansas", "Minnesota", "Missouri", "Nebraska", "North Dakota", "South Dakota","Delaware", "District of Columbia", "Florida", "Georgia", "Maryland", "North Carolina", "South Carolina", "Virginia", "West Virginia","Alabama", "Kentucky", "Mississippi", "Tennessee","Arkansas", "Louisiana", "Oklahoma", "Texas","Arizona", "Colorado", "Idaho", "Montana", "Nevada", "New Mexico", "Utah", "Wyoming","Alaska", "California", "Hawaii", "Oregon", "Washington")regions <-c(rep("Northeast", 9),rep("Midwest", 12),rep("South", 17),rep("West", 13))divisions <-c(rep("New England", 6), rep("Middle Atlantic", 3),rep("East North Central", 5), rep("West North Central", 7),rep("South Atlantic", 9), rep("East South Central", 4), rep("West South Central", 4),rep("Mountain", 8), rep("Pacific", 5))# Create the dataframestate_info <-data.frame(state = states,region = regions,division = divisions,stringsAsFactors =FALSE)# Print the dataframeprint(state_info)# Add the region column to the votes_by_county dataframevotes_by_region <- votes_by_county %>%left_join(state_info, by =c("state"="state"))``````{r}average_vote_share_region <- votes_by_region %>%group_by(region) %>%summarise(R_votes =sum(votesR, na.rm =TRUE),D_votes =sum(votesD, na.rm =TRUE),fracR = R_votes / (R_votes + D_votes),fracD = D_votes / (R_votes + D_votes) )# Print the tableaverage_vote_share_region |>select(region, fracR, fracD) |>gt() |>tab_header(title ="Average Vote Share by Region") |>cols_label(region ="Region",fracR ="Republican Vote Share",fracD ="Democratic Vote Share" ) |>fmt_number(columns =vars(fracR, fracD),decimals =2 )``````{r}average_vote_share_division <- votes_by_region %>%group_by(division) %>%summarise(R_votes =sum(votesR, na.rm =TRUE),D_votes =sum(votesD, na.rm =TRUE),fracR = R_votes / (R_votes + D_votes),fracD = D_votes / (R_votes + D_votes) )# Print the tableaverage_vote_share_division |>select(division, fracR, fracD) |>gt() |>tab_header(title ="Average Vote Share by Division") |>cols_label(division ="Division",fracR ="Republican Vote Share",fracD ="Democratic Vote Share" ) |>fmt_number(columns =vars(fracR, fracD),decimals =2 )``````{r}# Assuming votes_by_county already has the 'region' column added as beforevotes_by_state_region <- votes_by_region %>%group_by(state, region) %>%summarise(votesR =sum(votesR, na.rm =TRUE),votesD =sum(votesD, na.rm =TRUE) ) %>%mutate(total_votes = votesR + votesD,point_margin = (votesR - votesD) / total_votes *100,winning_party =if_else(point_margin >0, "Republican", "Democratic") ) %>%ungroup()# Plottingggplot(votes_by_state_region, aes(x = point_margin, y =reorder(state, point_margin), color = winning_party)) +geom_point(size =3) +facet_wrap(~ region, scales ="free_y", ncol =1) +geom_vline(xintercept =0, linetype ="solid", color ="black") +scale_color_manual(values =c("Democratic"="blue", "Republican"="red")) +labs(title ="2016 US Presidential Election Results by Region",x ="Point Margin",y ="",color ="Winning Party" ) +theme_minimal() +theme(axis.text.y =element_text(size =8),strip.text =element_text(size =10, face ="bold"),legend.position ="bottom")```Midwest region tends to vote Republican, while the Northeast regions tend to vote Democratic. Now let's further divide states into divisions and plot the voting results by division.```{r}# Assuming votes_by_county already has the 'region' column added as beforevotes_by_state_division <- votes_by_region %>%group_by(state, division) %>%summarise(votesR =sum(votesR, na.rm =TRUE),votesD =sum(votesD, na.rm =TRUE) ) %>%mutate(total_votes = votesR + votesD,point_margin = (votesR - votesD) / total_votes *100,winning_party =if_else(point_margin >0, "Republican", "Democratic") ) %>%ungroup()# Plottingggplot(votes_by_state_division, aes(x = point_margin, y =reorder(state, point_margin), color = winning_party)) +geom_point(size =3) +facet_wrap(~ division, scales ="free_y", ncol =3) +geom_vline(xintercept =0, linetype ="solid", color ="black") +scale_color_manual(values =c("Democratic"="blue", "Republican"="red")) +labs(title ="2016 US Presidential Election Results by Division",x ="Point Margin",y ="",color ="Winning Party" ) +theme_minimal() +theme(axis.text.y =element_text(size =8),strip.text =element_text(size =10, face ="bold"),legend.position ="bottom")```There does seem to be correlation between geographical location and voting pattern. For example, East North Central, East South Central, and West South Central regions tend to vote Republican, while the Pacific and New England regions tend to vote Democratic. However, there still exists divions where the voting pattern is not as clear-cut, such as South Atlantic and Mountain regions.This suggests that adding a new variable for region or division could be useful in predicting voting outcomes based on socio-economic indicators.Further into county level, we will plot the voting results by county.```{r}library(sf)library(usmap)library(tigris)options(tigris_use_cache =TRUE)# import the county shapefile for year 2016county_shp <-counties(year =2016)# Extract Connecticut data from county_shpct_data <- county_shp %>%filter(STATEFP %in%c("09")) %>%st_transform(crs =st_crs(4326)) %>%st_simplify(dTolerance =0.01)# Add FIPS code and rename columns for Connecticutct_data <- ct_data %>%mutate(fips =paste0(STATEFP, COUNTYFP),abbr =case_when( STATEFP =="09"~"CT" ),full =case_when( STATEFP =="09"~"Connecticut" ),county = NAMELSAD ) %>%select(fips, abbr, full, county, geom = geometry) %>%# Rearrange columnsusmap_transform() # Ensure the transformation is suitable for usmapus_counties <-us_map(regions ="counties")# Removing existing Connecticut data from us_countiesus_counties_removed <- us_counties %>%filter(!abbr %in%c("CT"))# Add the modified Connecticut data to us_countiesus_counties_merged <-bind_rows(us_counties_removed, ct_data)saveRDS(us_counties_merged, file ="modified_us_counties.rds")votes_by_county_map <- votes_by_county %>%mutate(fips =as.character(fips), # Convert to characterfips =if_else(str_length(fips) ==4, str_c("0", fips), fips)) # Add leading zero if length is 4# Return names of counties that are in the votes_by_county_map but not in us_counties_mergedmissing_counties1 <-setdiff(votes_by_county_map$fips, us_counties_merged$fips)print(votes_by_county_map |>filter(fips %in% missing_counties1))# Return names of counties that are in the us_counties_merged but not in votes_by_county_mapmissing_counties2 <-setdiff(us_counties_merged$fips, votes_by_county_map$fips)print(us_counties_merged |>filter(fips %in% missing_counties2))votes_by_county_map <- us_counties_merged %>%left_join(votes_by_county_map, by ='fips')# Get state bordersstate_borders <-us_map(regions ="states")# Get state centroid labelsstate_labels <- usmapdata::centroid_labels("states")# Plot the voting resultsggplot(votes_by_county_map) +geom_sf(aes(fill = winning_party, geometry = geom, alpha = frac_diff), color ='black') +geom_sf(data = state_borders, fill =NA, color ='white', size =1) +# Add state bordersgeom_sf_text(data = state_labels, aes(label = abbr, geometry = geom), size =3, color ="white", fontface ="bold") +# Add state abbreviationsscale_fill_manual(values =c("Republican"="red", "Democratic"="blue")) +scale_alpha_continuous(range =c(0.4, 1), guide ="none") +labs(title ="2016 US Presidential Election Results by County",fill ="Winning Party",alpha ="Vote Share Difference") +theme_minimal() +theme(axis.text =element_blank(),axis.title =element_blank(),panel.grid =element_blank())```After looking up information on the Alaska government website, we found that in 2019, the Valdez-Cordova county was ablished and replaced by the Chugach Census Area and the Copper River Census Area. Also, in 2015, the Wade Hampton Census Area was renamed to the Kusilvak Census Area, and the fips code was changed from 02270 to 02158. We will deal with this later.Histograms of all numerical variables at the county level```{r}# Reshape data to long format for easy plottingcounty_facts_long <- county_facts %>%select(-fips, -state_abbreviation, -county) %>%pivot_longer(everything(), names_to ="variable", values_to ="value")# Plot histograms for all numerical variablesggplot(county_facts_long, aes(x = value)) +geom_histogram(bins =30, fill ="blue", color ="black") +facet_wrap(~ variable, scales ="free_x") +labs(title ="Histograms of All Numerical Variables - County Level",x ="Value",y ="Frequency") +theme_minimal()``````{r}ggplot(votes_by_county, aes(x = fracR)) +geom_histogram(bins =30, fill ="red", color ="black") +labs(title ="Histogram of Republican Vote Share",x ="Republican Vote Share",y ="Frequency") +theme_minimal()```From the histograms, we can see that most of the predictors are right-skewed, indicating that the majority of counties have lower values for these indicators. This is expected as socio-economic indicators such as median household income, educational attainment, and racial demographics tend to vary significantly across counties.This also suggests that data transformation methods such as logarithm transformation may be necessary to address the skewness in the variables.Looking at the response variable fracR, we can see that the distribution is not symmetric, with a peak around 0.74. This indicates that the Republican vote share is higher in most counties.```{r}# Reshape data to long format for easy plottingstate_facts_long <- state_facts %>%select_if(is.numeric) %>%pivot_longer(everything(), names_to ="variable", values_to ="value")# Plot histograms for all numerical variablesggplot(state_facts_long, aes(x = value)) +geom_histogram(bins =30, fill ="blue", color ="black") +facet_wrap(~ variable, scales ="free_x") +labs(title ="Histograms of All Numerical Variables - State Level",x ="Value",y ="Frequency") +theme_minimal()``````{r}ggplot(votes_by_state, aes(x = fracR)) +geom_histogram(bins =30, fill ="red", color ="black") +labs(title ="Histogram of Republican Vote Share",x ="Republican Vote Share",y ="Frequency") +theme_minimal()```Compared to histograms for county level, there is reduced but still significant skewness for most indicators at the state level.```{r}# Preparation for merging demographic socio-economic indicators with county geo data# add leading 0 to fips with 4 digitscounty_facts_map <- county_facts |>mutate(fips =as.character(fips),fips =if_else(str_length(fips) ==4, str_c("0", fips), fips))# merge dataset county_facts and us_counties by fipscounty_facts_map <- us_counties_merged |>left_join(county_facts_map, by =c("fips"="fips")) |>select(-county.y, -state_abbreviation) |>rename(county = county.x)``````{r}# Load necessary librarieslibrary(ggplot2)library(dplyr)library(viridis)library(sf)library(gridExtra)# Prepare the list of indicators and their corresponding column namesindicators <-list(population ="PST045214",white_percentage ="RHI825214",foreign_percentage ="POP645213",bachelors_percentage ="EDU685213",median_household_income ="INC110213")# Create a function to generate choropleth maps for each indicator with log transformationplot_choropleth <-function(data, indicator, title, reduce_ticks =FALSE) { transformed_data <- data %>%mutate(transformed_value =log1p(!!sym(indicator))) max_value <-max(transformed_data$transformed_value, na.rm =TRUE) min_value <-min(transformed_data$transformed_value, na.rm =TRUE)# Adjust breaks for population plot to reduce tick labelsif (reduce_ticks) { breaks <-pretty(c(min_value, max_value), n =3) } else { breaks <-seq(ceiling(min_value), floor(max_value), by =1) } labels <-expm1(breaks) # Reverse the log1p transformationggplot(transformed_data) +geom_sf(aes(fill = transformed_value), color ="black") +scale_fill_viridis_c(option ="viridis", na.value ="grey50", breaks = breaks, labels =round(labels, 0)) +geom_sf(data = state_borders, fill =NA, color ='black', size =1) +# Add state bordersgeom_sf_text(data = state_labels, aes(label = abbr, geometry = geom), size =3, color ="white", fontface ="bold") +# Add state abbreviationslabs(title = title, fill = title) +theme_minimal() +theme(axis.text =element_blank(),axis.title =element_blank(),panel.grid =element_blank(),legend.position ="bottom")}# Assuming county_facts_map is your spatial dataset# Generate plots for each indicatorplots <-lapply(names(indicators), function(indicator) {if (indicator =="population") {plot_choropleth(county_facts_map, indicators[[indicator]], indicator, reduce_ticks =TRUE) } else {plot_choropleth(county_facts_map, indicators[[indicator]], indicator) }})# Display the plotsprint(plots)``````{r}# Merge votes_by_county with county_facts by fipsmerged_data_county <-left_join(votes_by_county, county_facts, by ="fips")merged_data_county <- merged_data_county |>filter(complete.cases(county.y)) |>select(-county.y, -state_abbreviation) |>rename(county = county.x)# Change the state name from District Of Columbia to District of Columbia in table state_factsstate_facts$state <-ifelse(state_facts$state =="District Of Columbia", "District of Columbia", state_facts$state)merged_data_state <-merge(votes_by_state, state_facts, by ="state")```We noticed that Kansas City, Missouri is missing in the `county_facts` dataset. We will remove this row from the merged dataset.Now, let's look at the averaged socio-economic indicators at the county level for each party. We will first start with population:```{r}# Calculate the average population for each partyaverage_population <- merged_data_county %>%group_by(winning_party) %>%summarise(average_population_2014 =mean(PST045214, na.rm =TRUE))# Create the side-by-side bar plot for average populationplot_population <-ggplot(average_population, aes(x = winning_party, y = average_population_2014, fill = winning_party)) +geom_bar(stat ="identity", position ="dodge") +scale_fill_manual(values =c("Democratic"="blue", "Republican"="red")) +labs(title ="County Average Population \n by Winning Party",x ="Winning Party",y ="Average Population",fill ="Winning Party") +theme_minimal()# Calculate the average population density for each partyaverage_density <- merged_data_county %>%group_by(winning_party) %>%summarise(average_density_2014 =mean(POP060210, na.rm =TRUE))# Create the side-by-side bar plot for average population densityplot_density <-ggplot(average_density, aes(x = winning_party, y = average_density_2014, fill = winning_party)) +geom_bar(stat ="identity", position ="dodge") +scale_fill_manual(values =c("Democratic"="blue", "Republican"="red")) +labs(title ="County Average Population Density\n by Winning Party",x ="Winning Party",y ="Average Population Density",fill ="Winning Party") +theme_minimal()# Calculate the average land area for each partyaverage_land_area <- merged_data_county %>%group_by(winning_party) %>%summarise(average_land_area =mean(LND110210, na.rm =TRUE))# Create the side-by-side bar plot for average land areaplot_land_area <-ggplot(average_land_area, aes(x = winning_party, y = average_land_area, fill = winning_party)) +geom_bar(stat ="identity", position ="dodge") +scale_fill_manual(values =c("Democratic"="blue", "Republican"="red")) +labs(title ="County Average Land Area \n by Winning Party",x ="Winning Party",y ="Average Land Area",fill ="Winning Party") +theme_minimal()# Adjust the title text size for each plotplot_population <- plot_population +theme(plot.title =element_text(size =10))plot_density <- plot_density +theme(plot.title =element_text(size =10))plot_land_area <- plot_land_area +theme(plot.title =element_text(size =10))# Arrange the three plots side by sidegrid.arrange(plot_population, plot_land_area, plot_density, ncol =3)```Demographic:```{r}# Define the demo_pct_columnsdemo_pct_columns <-c("AGE135214", "AGE295214", "AGE775214", "SEX255214", "RHI125214","RHI225214", "RHI325214", "RHI425214", "RHI525214", "RHI625214","RHI725214", "RHI825214", "POP715213", "POP645213", "POP815213")# Step 2: Calculate weighted averages for the demo_pct_columns by partyweighted_averages <- merged_data_county %>%mutate(population = PST045214) %>%group_by(winning_party) %>%summarise(across(all_of(demo_pct_columns), ~sum(.x * population, na.rm =TRUE) /sum(population, na.rm =TRUE), .names ="weighted_avg_{col}"))# Transform to long format for plottinglong_data <- weighted_averages %>%pivot_longer(cols =starts_with("weighted_avg"), names_to ="indicator", values_to ="value") %>%mutate(indicator =str_remove(indicator, "weighted_avg_")) %>%left_join(dictionary, by =c("indicator"="column_name")) %>%pivot_wider(names_from = winning_party, values_from = value) %>%pivot_longer(cols =c(Democratic, Republican), names_to ="Party", values_to ="Value")# Create the horizontal side-by-side bar chartggplot(long_data, aes(x =reorder(description, indicator), y = Value, fill = Party)) +geom_bar(stat ="identity", position ="dodge") +coord_flip() +scale_fill_manual(values =c("Democratic"="blue", "Republican"="red")) +labs(title ="Average Indicator Values by Party",x ="Indicator",y ="Average Value",fill ="Party") +theme_minimal() +theme(axis.text.x =element_text(hjust =1))```Educational Attainment```{r}library(gridExtra)# Calculate the average population for each partyhs_pct <- merged_data_county %>%group_by(winning_party) %>%summarise(average_hs =mean(EDU635213, na.rm =TRUE))# Create the side-by-side bar plot for average populationplot_hs <-ggplot(hs_pct, aes(x = winning_party, y = average_hs, fill = winning_party)) +geom_bar(stat ="identity", position ="dodge") +scale_fill_manual(values =c("Democratic"="blue", "Republican"="red")) +labs(title ="High School Graduate Percentage by Winning Party",x ="Winning Party",y ="Average Percentage",fill ="Winning Party") +theme_minimal()# Calculate the average population density for each partycollege_pct <- merged_data_county %>%group_by(winning_party) %>%summarise(average_college =mean(EDU685213, na.rm =TRUE))# Create the side-by-side bar plot for average population densityplot_college <-ggplot(college_pct, aes(x = winning_party, y = average_college, fill = winning_party)) +geom_bar(stat ="identity", position ="dodge") +scale_fill_manual(values =c("Democratic"="blue", "Republican"="red")) +labs(title ="Bachelor's degree or higher pct by Winning Party",x ="Winning Party",y ="Average Percentage",fill ="Winning Party") +theme_minimal()# Arrange the two plots side by sidegrid.arrange(plot_hs, plot_college, ncol =2)```Housing```{r}# Calculate simple averages for non-percentage attributessimple_averages <- merged_data_county %>%group_by(winning_party) %>%summarise(across(c("HSG010214", "HSG495213", "HSD410213"), ~mean(.x, na.rm =TRUE), .names ="avg_{col}"))# Transform to long format for plottinglong_data <- simple_averages %>%pivot_longer(cols =starts_with("weighted_avg_") |starts_with("avg_"), names_to ="indicator", values_to ="value") %>%mutate(indicator =str_remove(indicator, "weighted_avg_|avg_")) %>%left_join(dictionary, by =c("indicator"="column_name")) %>%pivot_wider(names_from = winning_party, values_from = value) %>%pivot_longer(cols =c(Democratic, Republican), names_to ="Party", values_to ="Value")# Create the horizontal side-by-side bar chartggplot(long_data, aes(x =reorder(description, Value), y = Value, fill = Party)) +geom_bar(stat ="identity", position ="dodge") +coord_flip() +scale_fill_manual(values =c("Democratic"="blue", "Republican"="red")) +labs(title ="Average Housing Indicator Values by Party",x ="Indicator",y ="Average Value",fill ="Party") +theme_minimal() +theme(axis.text.x =element_text(hjust =1))```housing-2```{r}# Columns to be usedhousing_columns <-c("HSG445213", "HSG096213", "HSG010214", "HSG495213", "HSD410213", "HSD310213", "BPS030214", "LFE305213")# Calculate weighted averages for percentage columns and simple average for the other columnweighted_averages <- merged_data_county %>%mutate(population = PST045214) %>%group_by(winning_party) %>%summarise(weighted_avg_HSG445213 =sum(HSG445213 * population, na.rm =TRUE) /sum(population, na.rm =TRUE),weighted_avg_HSG096213 =sum(HSG096213 * population, na.rm =TRUE) /sum(population, na.rm =TRUE),avg_HSD310213 =mean(HSD310213, na.rm =TRUE), avg_HSG010214 =mean(HSG010214, na.rm =TRUE),avg_HSG495213 =mean(HSG495213, na.rm =TRUE),avg_HSD410213 =mean(HSD410213, na.rm =TRUE),avg_BPS030214 =mean(BPS030214, na.rm =TRUE), avg_LFE305213 =mean(LFE305213, na.rm =TRUE) )# Transform to long format for plottinglong_data <- weighted_averages %>%pivot_longer(cols =starts_with("weighted_avg") |starts_with("avg"), names_to ="indicator", values_to ="value") %>%mutate(indicator =case_when( indicator =="weighted_avg_HSG445213"~"HSG445213: Homeownership rate", indicator =="weighted_avg_HSG096213"~"HSG096213: Housing units in multi-unit structures", indicator =="avg_HSD310213"~"HSD310213: Persons per household", indicator =="avg_HSG010214"~"HSG010214: Median value of owner-occupied housing units", indicator =="avg_HSG495213"~"HSG495213: Median selected monthly owner costs", indicator =="avg_HSD410213"~"HSD410213: Median gross rent", indicator =="avg_BPS030214"~"BPS030214: Building permits",TRUE~ indicator ))# Create the plotsplot_list <- long_data %>%split(.$indicator) %>%lapply(function(data) {ggplot(data, aes(x = winning_party, y = value, fill = winning_party)) +geom_bar(stat ="identity", position ="dodge") +scale_fill_manual(values =c("Democratic"="blue", "Republican"="red")) +labs(title =paste("Average", data$indicator[1], "by Winning Party"),x ="Winning Party",y = data$indicator[1],fill ="Winning Party") +theme_minimal() })# Display the plots in a single viewlibrary(gridExtra)do.call(grid.arrange, c(plot_list, ncol =2))```Employment-1```{r}# Load necessary librarieslibrary(dplyr)library(ggplot2)library(tidyr)library(gridExtra)# Columns to be usedbusiness_columns <-c("BZA010213", "BZA110213", "NES010213", "SBO001207")# Calculate averages for the business columnsaverages <- merged_data_county %>%group_by(winning_party) %>%summarise(across(all_of(business_columns), mean, na.rm =TRUE, .names ="avg_{col}"))# Transform to long format for plottinglong_data <- averages %>%pivot_longer(cols =starts_with("avg"), names_to ="indicator", values_to ="value") %>%mutate(indicator =case_when( indicator =="avg_BZA010213"~"BZA010213: Number of nonemployer establishments", indicator =="avg_BZA110213"~"BZA110213: Number of employer establishments", indicator =="avg_BZA115213"~"BZA115213: Employer establishments, percent change", indicator =="avg_NES010213"~"NES010213: Number of employees", indicator =="avg_SBO001207"~"SBO001207: Number of firms",TRUE~ indicator ))# Create the plotsplot_list <- long_data %>%split(.$indicator) %>%lapply(function(data) {ggplot(data, aes(x = winning_party, y = value, fill = winning_party)) +geom_bar(stat ="identity", position ="dodge") +scale_fill_manual(values =c("Democratic"="blue", "Republican"="red")) +labs(title = data$indicator[1],x ="Winning Party",y ="Average Value",fill ="Winning Party") +theme_minimal() })# Display the plots in a single viewdo.call(grid.arrange, c(plot_list, ncol =2))```Employment-2```{r}# Load necessary librarieslibrary(dplyr)library(ggplot2)library(tidyr)library(gridExtra)# Columns to be usedbusiness_columns <-c("BZA115213", "SBO315207", "SBO115207", "SBO215207", "SBO515207", "SBO415207", "SBO015207")# Calculate weighted averages for the business columnsweighted_averages <- merged_data_county %>%mutate(population = PST045214) %>%group_by(winning_party) %>%summarise(across(all_of(business_columns), ~sum(.x * population, na.rm =TRUE) /sum(population, na.rm =TRUE), .names ="weighted_avg_{col}"))# Transform to long format for plottinglong_data <- weighted_averages %>%pivot_longer(cols =starts_with("weighted_avg"), names_to ="indicator", values_to ="value") %>%mutate(indicator =str_remove(indicator, "weighted_avg_")) %>%mutate(description =case_when( indicator =="BZA115213"~"Employer establishments, percent change", indicator =="SBO315207"~"Black-owned firms, percent", indicator =="SBO115207"~"American Indian- and Alaska Native-owned firms, percent", indicator =="SBO215207"~"Asian-owned firms, percent", indicator =="SBO515207"~"Native Hawaiian- and Other Pacific Islander-owned firms, percent", indicator =="SBO415207"~"Hispanic-owned firms, percent", indicator =="SBO015207"~"Women-owned firms, percent",TRUE~ indicator ))# Create the plotsplot_list <- long_data %>%split(.$indicator) %>%lapply(function(data) {ggplot(data, aes(x = winning_party, y = value, fill = winning_party)) +geom_bar(stat ="identity", position ="dodge") +scale_fill_manual(values =c("Democratic"="blue", "Republican"="red")) +labs(title = data$description[1],x ="Winning Party",y ="Weighted Average Value",fill ="Winning Party") +theme_minimal() })# Display the plots in a single viewdo.call(grid.arrange, c(plot_list, ncol =2))```Sales:```{r}# Columns to be usedsales_columns <-c("MAN450207", "WTN220207", "RTN130207", "RTN131207", "AFN120207")# Calculate simple averages for the specified columnssimple_averages <- merged_data_county %>%group_by(winning_party) %>%summarise(across(all_of(sales_columns), ~mean(.x, na.rm =TRUE), .names ="avg_{col}"))# Transform to long format for plottinglong_data <- simple_averages %>%pivot_longer(cols =starts_with("avg"), names_to ="indicator", values_to ="value") %>%mutate(indicator =str_remove(indicator, "avg_")) %>%mutate(description =case_when( indicator =="MAN450207"~"Manufacturers shipments, 2007 ($1,000)", indicator =="WTN220207"~"Wholesale trade sales, 2007 ($1,000)", indicator =="RTN130207"~"Retail sales, 2007 ($1,000)", indicator =="RTN131207"~"Retail sales per capita, 2007", indicator =="AFN120207"~"Accommodation and food services sales, 2007 ($1,000)",TRUE~ indicator ))# Create the plotsplot_list <- long_data %>%split(.$indicator) %>%lapply(function(data) {ggplot(data, aes(x = winning_party, y = value, fill = winning_party)) +geom_bar(stat ="identity", position ="dodge") +scale_fill_manual(values =c("Democratic"="blue", "Republican"="red")) +labs(title = data$description[1],x ="Winning Party",y ="Average Value",fill ="Winning Party") +theme_minimal() })# Display the plots in a single viewdo.call(grid.arrange, c(plot_list, ncol =2))```